Vector datasets catalog and downloader by connortsui20 · Pull Request #7446 · vortex-data/vortex

connortsui20 · 2026-04-15T15:15:07Z

Summary

Tracking issue: #7297

We will want to add vector benchmarking soon (see #7399 for a draft).

This adds a simple catalog for the vector datasets hosted by https://assets.zilliz.com/benchmark for VectorDBBench, which both describes the shape of the datasets (are things partitioned, randomly shuffled, are there neighbors lists for top k, etc).

Also handles downloading everything.

I had to verify that all of this stuff was correct by looking at the S3 buckets themselves:

aws s3 ls s3://assets.zilliz.com/benchmark/ --region us-west-2 --no-sign-request

Details

for d in bioasq_large_10m bioasq_medium_1m cohere_large_10m cohere_medium_1m \
         cohere_small_100k gist_medium_1m gist_small_100k glove_medium_1m \
         glove_small_100k laion_large_100m  \
         openai_large_5m openai_medium_500k openai_small_50k \
         sift_large_50m sift_medium_5m sift_small_500k; do
  echo "=== $d ==="
  aws s3 ls s3://assets.zilliz.com/benchmark/$d/ --region us-west-2 --no-sign-request
done

And this script from the main repo helped too: https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/dataset.py

Things that are not implemented that I would like to add:

Is the dataset pre-normalized for cosine similarity? This is not so obvious to me without actually working with the datasets, so I will do this later.
Some datasets have scalar labels for all vectors that help mimic similarity + filter by some other column. Some of them also have neighbor lists for these specific filtered queries. So that is something we'll probably want to add in the future.

Testing

N/A

codspeed-hq · 2026-04-15T15:22:57Z

Merging this PR will not alter performance

✅ 1153 untouched benchmarks
⏩ 1455 skipped benchmarks¹

_{Comparing ct/vector-datasets (5d70dd0) with develop (9406303)}

1455 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

connortsui20 · 2026-04-15T16:08:47Z

We probably want to mirror this somewhere. @AdamGS @robert3005 is there an easy way to do this?

AdamGS · 2026-04-15T16:23:26Z

R2 is probably the easiest? Whatever we use for the clickbench data

AdamGS · 2026-04-15T16:24:48Z

+}
+
+/// Stream a large file to disk with a byte-progress bar.
+async fn download_with_progress(client: &Client, url: &str, output: &PathBuf) -> Result<()> {


Why do we need another one of these? Can't this be part of the general download utils we have here?

(this function and a bunch of the following ones)

maybe? But afaict we dont use a reqwest client anywhere else

we definitely do

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

AdamGS

overall, nothing objectionable here, lets ship it

connortsui20 · 2026-04-15T18:23:19Z

ok im yoloing this since this doesnt affect anyone else. At some point it would be good to unify the downloading but if we are going to do that then we might as well implement the catalog idea that @joseph-isaacs had.

AdamGS · 2026-04-15T18:39:59Z

@joseph-isaacs what's the catalog idea? worth writing down somewhere?

## Summary Tracking issue: #7297 Adds a TurboQuant demo where we convert the parquet files to a Vortex file (in-memory only now, but still serialized as bytes), and then we verify by decoding and performing a basic cosine similarity expression search with a filter pushdown. This is based on top of #7446, please dont merge until that has merged ## Testing The example runs! Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 requested review from AdamGS and robert3005 April 15, 2026 15:15

connortsui20 added the changelog/feature A new feature label Apr 15, 2026

connortsui20 force-pushed the ct/vector-datasets branch from 8c86302 to 434c57b Compare April 15, 2026 15:18

connortsui20 mentioned this pull request Apr 15, 2026

Tracking Issue: Vector Similarity Search #7297

Open

38 tasks

AdamGS reviewed Apr 15, 2026

View reviewed changes

Comment thread vortex-bench/src/vector_dataset/download.rs Outdated

connortsui20 force-pushed the ct/vector-datasets branch from 434c57b to 1a462ba Compare April 15, 2026 16:31

vector dataset catalog and downloader

5d70dd0

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 force-pushed the ct/vector-datasets branch from 1a462ba to 5d70dd0 Compare April 15, 2026 18:10

AdamGS approved these changes Apr 15, 2026

View reviewed changes

connortsui20 mentioned this pull request Apr 15, 2026

Demo TurboQuant basic search with serialization #7451

Merged

connortsui20 requested a review from gatesn April 15, 2026 18:16

connortsui20 merged commit bff43dc into develop Apr 15, 2026
62 checks passed

connortsui20 deleted the ct/vector-datasets branch April 15, 2026 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector datasets catalog and downloader#7446

Vector datasets catalog and downloader#7446
connortsui20 merged 1 commit intodevelopfrom
ct/vector-datasets

connortsui20 commented Apr 15, 2026

Uh oh!

codspeed-hq bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

connortsui20 commented Apr 15, 2026

Uh oh!

AdamGS commented Apr 15, 2026

Uh oh!

AdamGS Apr 15, 2026

Uh oh!

AdamGS Apr 15, 2026

Uh oh!

connortsui20 Apr 15, 2026

Uh oh!

AdamGS Apr 15, 2026

Uh oh!

Uh oh!

AdamGS left a comment

Uh oh!

connortsui20 commented Apr 15, 2026

Uh oh!

Uh oh!

AdamGS commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

connortsui20 commented Apr 15, 2026

Summary

Testing

Uh oh!

codspeed-hq bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

connortsui20 commented Apr 15, 2026

Uh oh!

AdamGS commented Apr 15, 2026

Uh oh!

AdamGS Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

AdamGS Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

connortsui20 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

AdamGS Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AdamGS left a comment

Choose a reason for hiding this comment

Uh oh!

connortsui20 commented Apr 15, 2026

Uh oh!

Uh oh!

AdamGS commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq bot commented Apr 15, 2026 •

edited

Loading